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Abstract 

The theory of orthogonal polynomials on the unit circle (OPUC) dates back 
to Szego's work of 1915-21, and has been given a great impetus by the recent 
work of Simon, in particular his two- volume book [Si4], [Si5], the survey pa- 
per (or summary of the book) [Si3] , and the book [Si9] , whose title we allude 
to in ours. Simon's motivation comes from spectral theory and analysis. An- 
other major area of application of OPUC comes from probability, statistics, 
time series and prediction theory; see for instance the book by Grenander 
and Szego [GrSz]. Coming to the subject from this background, our aim here 
is to complement [Si3] by giving some probabilistically motivated results. We 
also advocate a new definition of long-range dependence. 
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§1. Introduction 

The subject of orthogonal polynomials on the real line (OPRL), at least 
some of which forms part of the standard undergraduate curriculum, has 
its roots in the mathematics of the 19th century. The name of Gabor 
Szego (1895-1985) is probably best remembered nowadays for two things: 
co-authorship of 'Polya and Szego' [PoSz] and authorship of 'Szego' [Sz4], 
his book of 1938, still the standard work on OPRL. Perhaps the key result in 
OPRL concerns the central role of the three-term recurrence relation ([Sz4], 
III.3.2: 'Favard's theorem'). 

Much less well known is the subject of orthogonal polynomials on the unit 
circle (OPUC), which dates from two papers of Szego in 1920-21 ([Sz2], [Sz3]), 
and to which the last chapter of [Sz4] is devoted. Again, the key is the appro- 
priate three-term recurrence relation, the Szego recursion or Durbin- Levins on 
algorithm (§2). This involves a sequence of coefficients (not two sequences, 
as with OPRL), the Verblunsky coefficients a = (a n ) (§2), named (there are 
several other names in use) and systematically exploited in the magisterial 
two- volume book on OPUC ( [Si4] , [Si5] ) by Barry Simon. See also his survey 
paper [Si3], written from the point of view of analysis and spectral theory, 
the survey [GoTo], and his recent book [Si9] . 

Complementary to this is our own viewpoint, which comes from proba- 
bility and statistics, specifically time series (as does the excellent survey of 
1986 by Bloomfield [B13]). Here we have a stochastic process (random phe- 
nomenon unfolding with time) X = (X n ) with n integer (time discrete, as 
here, corresponds to compactness of the unit circle by Fourier duality, whence 
the relevance of OPUC; continuous time is also important, and corresponds 
to OPRL). 

We make a simplifying assumption, and restrict attention to the station- 
ary case. The situation is then invariant under the shift n i— > n + 1, which 
makes available the powerful mathematical machinery of Beurling's work on 
invariant subspaces ([Beu]; [Nikl]). While this is very convenient mathemat- 
ically, it is important to realize that this is both a strong restriction and one 
unlikely to be satisfied exactly in practice. One of the great contributions 
of the statistician and econometrician Sir Clive Granger (1934-2009) was to 
demonstrate that statistical/econometric methods appropriate for station- 
ary situations can, when applied indiscriminately to non- stationary situa- 
tions, lead to misleading conclusions (via the well-known statistical problem 



2 



of spurious regression). This has profound implications for macroeconomic 
policy. Governments depend on statisticians and econometricians for advice 
on interpretation of macroeconomic data. When this advice is misleading 
and mistaken policy decisions are implemented, avoidable economic losses (in 
terms of GDP) may result which are large-scale and permanent (cf. Japan's 
'lost decade' in the 1990s, or lost two decades, and the global problems of 
2007-8 on). 

The mathematical machinery needed for OPUC is function theory on the 
(unit) disc, specifically the theory of Hardy spaces and Beurling's theorem 
(factorization into inner and outer functions and Blaschke products). We 
shall make free use of this, referring for what we need to standard works 
(we recommend [Du], [Ho], [Gar], [Kool], [Nikl], [Nik2]), but giving detailed 
references. The theory on the disc (whose boundary the circle is compact) 
corresponds analytically to the theory on the upper half-plane, whose bound- 
ary the real line is non-compact (for which see e.g. [DymMcK]). Probabilis- 
tically, we work on the disc in discrete time and the half-plane in continuous 
time. In each case, what dominates is an integrability condition. In discrete 
time, this is Szegd's condition (Sz), or non-determinism (ND) - integrability 
of the logarithm log w of the spectral density w (of ji) (§3). In continuous 
time, this is the logarithmic integral, which gives its name to Koosis' book 
[Koo2]. 

In view of the above, the natural context in which to work is that of 
complex- valued stochastic processes, rather than real-valued ones, in discrete 
time. We remind the reader that here the Cauchy-Schwarz inequality tells 
us that correlation coefficients lie in the unit disc, rather than the interval 

[-1,1]- 

The time-series aspects here go back at least as far as the work of Wiener 
[Wil] in 1932 on generalized harmonic analysis, GHA (which, incidentally, 
contains a good historical account of the origins of spectral methods, e.g. 
in the work of Sir Arthur Schuster in the 1890s on heliophysics) . During 
World War II, the linear filter (linearity is intimately linked with Gaussian- 
ity) was developed independently by Wiener in the USA [Wi2], motivated by 
problems of automatic fire control for anti-aircraft artillery, and Kolmogorov 
in Russia (then USSR) [Kol]. This work was developed by the Ukrainian 
mathematician M. G. Krein over the period 1945-1985 (see e.g. [Dym]), by 
Wiener in the 1950s ([Wi3], IG, including commentaries) and by I. A. Ibrag- 
imov (1968 on). 

The subject of time series is of great practical importance (e.g. in econo- 
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metrics), but suffered within statistics by being regarded as 'for experts only'. 
This changed with the 1970 book by Box and Jenkins (see [BoxJeRe]), which 
popularized the subject by presenting a simplified account (including an easy- 
to-follow model- fitting and model-checking recipe), based on ARMA models 
(AR for autoregressive, MA for moving average). The ARMA approach is 
still important; see e.g. Brockwell and Davis [BroDav] for a modern textbook 
account. The realization that the Verblunsky coefficients a of OPUC are ac- 
tually the partial autocorrelation function (PACF) of time series opened the 
way for the systematic exploitation of OPUC within time series by a number 
of authors. These include Inoue, in a series of papers from 2000 on (see es- 
pecially [In3] of 2008), and Inoue and Kasahara from 2004 on (see especially 
[InKa2] of 2006). 

Simon's work ([Si3], [Si4], [Si5]) focusses largely on four conditions, two 
weak (and comparable) and two strong (and non-comparable). Our aim 
here is to complement the expository account in [Si3] by adding the time- 
series viewpoint. This necessitates adding (at least) five new conditions. 
Four of these (comparable) we regard as intermediate, the fifth as strong. 
In our view, one needs three levels of strength here, not two. One is re- 
minded of the Goldilocks principle (from the English children's story: not 
too hot/hard/high/..., not too cold/soft/low/..., but just right). 

We begin in §2 by presenting the basics (Verblunsky's theorem, PACF). 
We turn in §3 to weak conditions (Szego's condition (Sz), or (ND); Szego's 
theorem; a G £2] °~ > 0). In §4 we look at our first strong condition, Baxter's 
condition (B), and Baxter's theorem (a G £±). The satisfaction or otherwise 
of Baxter's condition (B) marks the transition between short- and long-range 
dependence. The second strong condition, the strong Szego condition (sSz), 
follows in §5 (strong Szego limit theorem, Ibragimov's theorem, Golinskii- 
Ibragimov theorem, Borodin-Okounkov formula; a G H 1 ^ 2 ), together with 
a weakening of (sSz), absolute regularity. We turn in §6 to intermediate 
conditions: in decreasing order of strength, (i) complete regularity; (ii) posi- 
tive angle (Helson-Szego, Helson-Sarason and Sarason theorems); (iii) (pure) 
minimality (Kolmogorov); (iv) rigidity (Sarason), Levinson-McKean condi- 
tion (LM), complete non-determinism (CND), intersection of past and future 
(IPF); see [KaBi] for details. We close in §7 with some remarks. 

The (weak) Szego limit theorem dates from 1915 [Szl], the strong Szego 
limit theorem from 1952 [Sz5]. Simon ( [Si4] , 11) rightly says how remark- 
able it is for one person to have made major contributions to the same area 
37 years apart. We note that Szego's remarkable longevity here is actually 
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exceeded (over the 40 years 1945-1985) by that of the late, great Mark Grig- 
orievich Krein (1907-1989). 

What follows is a survey of this area, which contains (at least) eight dif- 
ferent layers, of increasing (or decreasing) generality. This is an increase on 
Simon's (basic minimum of) four. We hope that no one will be deterred by 
this increase in dimensionality, and so in apparent complexity. Our aim is the 
precise opposite: to open up this fascinating area to a broader mathematical 
public, including the time-series, probabilistic and statistical communities. 
For this, one needs to open up the 'grey zone' between the strong and weak 
conditions, and examine the third category, of intermediate conditions . We 
focus on these three levels of generality. This largely reduces the effective di- 
mensionality to three, which we feel simplifies matters. Mathematics should 
be made as simple as possible, but not simpler (to adapt Einstein's immortal 
dictum about physics). 

We close by quoting Barry Simon ([Si8], 85): "It's true that until Eu- 
clidean Quantum Field Theory changed my tune, I tended to think of prob- 
abilists as a priesthood who translated perfectly simple functional analytic 
ideas into a strange language that merely confused the uninitiated." He con- 
tinues: in his 1974 book on Euclidean Quantum Field Theory, "the dedication 
says: "To Ed Nelson who taught me how unnatural it is to view probability 
theory as unnatural" " . 

§2. Verblunsky's theorem and partial autocorrelation. 

Let X = (X n : n G Z) be a discrete-time, zero-mean, (wide-sense) sta- 
tionary stochastic process, with autocovariance function 7 = (j n ), 



(the variance is constant by stationarity, so we may take it as 1, and then 
the autocovariance reduces to the autocorrelation). 

Let % be the Hilbert space spanned by X = (X n ) in the L 2 -space of the 
underlying probability space, with inner product (X, Y) := E[XY] and norm 
||X|| := [i?(|X| 2 )] 1//2 . Write T for the unit circle, the boundary of the unit 
disc D, parametrised by z = e td ; unspecified integrals are over T. 

Theorem 1 (Kolmogorov Isomorphism Theorem). There is a process 
Y on T with orthogonal increments and a probability measure /i on T with 
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(i) 





E[dY{tf] = dn{t). 



(iii) The autocorrelation function 7 then has the spectral representation 



(iv) One has the Kolmogorov isomorphism between % (the time domain) and 
L 2 (/i) (the frequency domain) given by 



for integer t (as time is discrete). 

Proof. Parts (i), (ii) are the Cramer representation of 1942 ([Cra], [Do] X.4; 
Cramer and Leadbetter [CraLea] §7.5). Part (iii), due originally to Herglotz 
in 1911, follows from (i) and (ii)([Do] X.4, [BroDav] §4.3). Part (iv) is due 
to Kolmogorov in 1941 [Kol]. All this rests on Stone's theorem of 1932, giv- 
ing the spectral representation of groups of unitary transformations of linear 
operators on Hilbert space; see [Do] 636-7 for a historical account and refer- 
ences (including work of Khintchine in 1934 in continuous time), [DunSch] 
X.5 for background on spectral theory. / / 

The reader will observe the link between the Kolmogorov Isomorphism 
Theorem and (ii), and its later counterpart from 1944, the Ito Isomorphism 
Theorem and (dB t ) 2 = dt in stochastic calculus. 

To avoid trivialities, we suppose in what follows that /i is non-trivial - 
has infinite support. 

Since for integer t the e lte span polynomials in e td , prediction theory for 
stationary processes reduces to approximation by polynomials. This is the 
classical approach to the main result of the subject, Szego's theorem (§2 
below); see e.g. [GrSz], Ch. 3, [Ach], Addenda, B. We return to this in §7.7 
below. 




X t <-> e 



it. 



(KIT) 



We write 



dfi(6) = w(9)d9/2n + dfj JS (9), 



6 



so w is the spectral density (w.r.t. normalized Lebesgue measure) and /i s is 
the singular part of /i. 
By stationarity, 

E[X m X n ] = 7| m _ n |. 
The Toeplitz matrix for X, or /i, or 7, is 

r : = (7«)» where 7*j := 7|<-j|- 

It is positive definite. 

For n E N, write "H[_ n) _i] for the subspace of % spanned by {X_ n , . . . , 
(the finite past at time of length n), Pr_ ni _i] for projection onto "Hr_ ni _i] 
(thus P[_ nj _i]X is the best linear predictor of X based on the finite past), 
P^l n -il :— I — P[- n -i] for the orthogonal projection (thus P^ n _^X : = 
X — P[_ n _i]Xo is the prediction error). We use a similar notation for pre- 
diction based on the infinite past. Thus "H(-oo,-oo] is the closed linear span 
(els) of X k , k < —1, P(-oo,-i] is the corresponding projection, and similarly 
for other time-intervals. Write 

for the (subspace generated by) the past up to time n, 

00 

n=—oo 

for their intersection, the (subspace generated by) the remote past. With 
corr(Y,Z) := E[YZ}/ E[\Y\ 2 ].E[\Z\ 2 ] for Y, Z zero-mean and not a.s. 0, 
write also 

a n := corr(X n - P[i jn -i]X n , X - P[i >n -i]X ) 

for the correlation between the residuals at times 0, n resulting from (linear) 
regression on the intermediate values Xi, . . . , X n _i. The sequence 

a = K)~=i 

is called the partial autocorrelation function (PACF). It is also called the 
sequence of Verblunsky coefficients, for reasons which will emerge below. 

Theorem 2 (Verblunsky's Theorem. There is a bijection between the 
sequences a = (a n ) with each a n G D and the probability measures \i on T. 
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This result dates from Verblunsky in 1936 [V2], in connection with OPUC. 
It was re-discovered long afterwards by Barndorff-Nielsen and Schou [BarN-S] 
in 1973 and Ramsey [Ram] in 1974, both in connection with parametrization 
of time-series models in statistics. The Verblunsky bijection has the great 
advantage to statisticians of giving an unrestricted parametrization: the only 
restrictions on the a n are the obvious ones resulting from their being corre- 
lations - \a n \ < 1, or as fi is non-trivial, \a n \ < 1. By contrast, 7 = (7 n ) 
gives a restricted parametrization, in that the possible values of 7„ are re- 
stricted by the inequalities of positive-definiteness (principal minors of the 
Toeplitz matrix T are positive). This partly motivates the detailed study of 
the PACF in, e.g., [Inl], [In2], [In3], [InKal], [InKa2]. For general statistical 
background on partial autocorrelation, see e.g. [KenSt], Ch. 27 (Vol. 2), 
§46.26-28 (Vol. 3). 

As we mentioned in §1, the basic result for OPUC corresponding to 
Favard's theorem for OPRL is Szego' s recurrence (or recursion): given a 
probability measure \i on T, let $ n be the monic orthogonal polynomials 
they generate (by Gram-Schmidt orthogonalization). For every polynomial 
Q n of degree n, write 

Q* n (z) := z n Q n (l/z) 
for the reversed polynomial. Then the Szego recursion is 

$ n +i(z) = z<f> n (z) - a n+1 <f>* n (z), 

where the parameters a n lie in D: 

|«n| < 1, 

and are the Verblunsky coefficients (also known variously as the Szego, Schur, 
Geronimus and reflection coefficients; see [Si4], §1.1). The double use of the 
name Verblunsky coefficients and the notation a = (a n ) for the PACF and 
the coefficients is justified: the two coincide. Indeed, the Szego recursion is 
known in the time-series literature as the Durbin- Levinson algorithm; see e.g. 
[BroDav], §§3.4, 5.2. The term Verblunsky coefficient is from Simon [Si4], to 
which we refer repeatedly. We stress that Simon writes a n for our a n+ i, and 

so has n — 0, 1, . . . where we have n — 1, 2, Our notational convention is 

already established in the time-series literature (see e.g. [BroDav], §§3.4, 5.2), 
and is more convenient in our context of the PACF, where n — 1, 2, . . . has the 
direct interpretation as a time-lag between past and future (cf. [Si4], (1.5.15), 
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p. 56-57). See [Si4], §1.5 and (for two proofs of Verblunsky's theorem) 
§1.7, 3.1, and [McLZ] for a recent application of the unrestricted PACF 
parametrization. 

One may partially summarize the distributional aspects of Theorems 1 
and 2 by the one-one correspondences 



a ■<->■ n •<-)■ 7. 

The Durbin- Levins on algorithm 
Write 

X n+ \ := <p n iX n + . . . + 4> nn Xi 
for the best linear predictor of X n+1 given X n , . . . , Xi, 

v n '■= E[(X n+1 — X n+1 ) 2 ] = E[(X n+ i — P[i t n]X n+ i) 2 ] 

for the mean-square error in the prediction of X n+1 based on X 1 , . . . , X n , 

<Pn '■= (0nl, • • • ,4>nn) T U'P C ) 

for the vector of finite-predictor coefficients. The Durbin-Levinson algorithm 
([Lev], [Dur]; [BroDav] §5.2, [Pou] §7.2) gives the <f> n +i, v n+ i recursively, in 
terms of quantities known at time n, as follows: 

(i) The first component of 4> n +\ is given by 

n 

0n+l,n+l = [ln+1 ~ 4>njln-j] /v n - 
3=1 

The 4> nn are the Verblunsky coefficients a n : 

(ii) The remaining components are given by 



/<f>. 



n+1,1 



V 0n+l,r. 



9n\ 



.<f>r, 



-0, 



n+l,n+l 



Otn+1 



nl 



\<t>, 



\<f>nl 



(iii) The prediction errors are given recursively by 



v = 1, v n+ i = v n [l - |0 n+ i )n+ i| 2 ] = v n [l - |a n+ i| 2 ]. 
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In particular, v n > and we have from (ii) that 

4>nj — <f>n+l,j — Otn+l<Pn,n+l-j- {DL). 

Since by (iii) 

n 

v n = nt 1 - i a «i 2 i' 

J'=l 

the n-step prediction error variance v n — > a 2 > iff the infinite product 
converges, that is, a 6 4 an important condition that we will meet in §3 
below in connection with Szego's condition. 

Note. 1. The Durbin-Levinson algorithm is related to the Yule- Walker equa- 
tions of time-series analysis (see e.g. [BroDav], §8.1), but avoids the need 
there for matrix inversion. 

2. The computational complexity of the Durbin-Levinson algorithm grows 
quadratically, rather than cubically as one might expect; see e.g. Golub and 
van Loan [GolvL], §4.7. Its good numerical properties result from efficient 
use of the Toeplitz character of the matrix T (or equivalently, of Szego re- 
cursion) . 

3. See [KatSeTe] for a recent approach to the Durbin-Levinson algorithm, 
and [Deg] for the multivariate case. 

Stochastic versus non-stochastic 

This paper studies prediction theory for stationary stochastic processes. 
As an extreme example (in which no prediction is possible), take the 'free' 
case, in which the X n are independent (and identically distributed). Then 
dn{0) = a!9/2n, ln = S n0 , a n = 0, $ n (z) = z n ([Si4], Ex. 1.6.1). 

In contrast to this is the situation where X = (X n ) is non- stochastic - 
deterministic, but (typically) chaotic. This case often arises in non-linear 
time-series analysis and dynamical systems; for a monograph treatment, see 
Kantz and Schreiber [KanSch]. 

One natural way to classify results on OPUC is by the strength of the 
conditions that they impose. Simon's book discusses a range of conditions, 
starting with a fairly weak one, Szego's condition ([Si4] Ch. 2 and §3 below), 
and proceeding to two principal stronger ones, Baxter's condition ( [Si4] Ch. 
5 and §4 below) and the strong Szego condition ([Si4] Ch. 6 and §5 below). 
From a probabilistic viewpoint, equally important are a range of intermedi- 
ate conditions not discussed in Simon's book. These we discuss in §6. We 
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close with some remarks in §7. 

§3. Weak conditions: Szego's theorem. 

Rakhmanov's Theorem 

One naturally expects that the influence of the distant past decays with 
increasing lapse of time. So one wants to know when 

— > (n — ?■ oo). 

By Rakhmanov's theorem ([Rak]; [Si5] Ch 9, and Notes to §9.1, [MatNeTo]), 
this happens if the density w of the absolutely continuous component \i a is 
positive on a set of full measure: 

|{0 : w(0) > 0}| = 1 
(using normalized Lebesgue measure - or 2n using Lebesgue measure). 

Non-determinism and the Wold decomposition. 

Write a 2 for the one-step mean-square prediction error: 

a 2 := E[(X - P(_oo -i\X ) 2 ]; 

by stationarity, this is the a 2 = lim^oo v n above. Call X non- deterministic 
(ND) if (7 > 0, deterministic if a — 0. (This usage is suggested by the usual 
one of non-randomness being zero- variance, though here a non-deterministic 
process may be random, but independent of time, so the stochastic process 
reduces to a random variable.) The Wold decomposition (von Neumann [vN] 
in 1929, Wold [Wo] in 1938; see e.g. Doob [Do], XII.4, Hannan [Hal], Ch. 
Ill) expresses a process X as the sum of a non-deterministic process U and 
a deterministic process V: 

X n — U n + V n ; 
the process U is a moving average, 

n oo 
j=—oo k=0 

with the £j zero-mean and uncorrelated, with each other and with V; E[£ n ] = 
0, var(£ n ) = E[£ 2 ] = a 2 . Thus when a = the £ n are 0, Z7 is missing and the 
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process is deterministic. When a > 0, the spectral measures of U n , V n are 
fi ac and fi s , the absolutely continuous and singular components of Think 
of £ n as the 'innovation' at time n - the new random input, a measure of 
the unpredictability of the present from the past. This is only present when 
a > 0; when a = 0, the present is determined by the past - even by the 
remote past. 

The Wold decomposition arises in operator theory ([vN]; Sz.-Nagy and 
Foias in 1970 [SzNF], Rosenblum and Rovnyak in 1985 [RoRo], §1.3, [Nik2]), 
as a decomposition into the unitary and completely non-unitary (cnu) parts. 



Szego's Theorem 



Theorem 3 (Szego's Theorem). 

(i) a > iff logu> G Li, that is, 

j -\ogw(6)d6 > -oo. (Sz) 

(ii) a > iff a G £ 2 - 

^nr^-Ki 2 ), 

so a > iff the product converges, i.e. iff 

^|a Tt | 2 < oo : a E £ 2 ; 
(iv) a 2 is the geometric mean of /i: 

a 2 = exp(^ / logw(O)d0) =: G(/i) > 0. (K) 

2,71 J 

Proof. Parts (i), (ii) are due to Szego [Sz2], [Sz3] in 1920-21, with ji abso- 
lutely continuous, and to Verblunsky [V2] in 1936 for general ji. See [Si4] 
Ch. 2, [Si9] Ch. 2. Parts (iii) and (iv) are due to Kolmogorov in 1941 [Kol]. 
Thus (K) is called Kolmogorov 's formula. The alternative name for Szego's 
condition (Sz) is the non- determinism condition (ND), above. // 



We now restrict attention to processes for which Szego's condition holds; 
indeed, we shall move below to stronger conditions. 

The original motivation of Szego, and later Verblunsky, was approxima- 
tion theory, specifically approximation by polynomials. The Kolmogorov 
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Isomorphism Theorem allows us to pass between finite sections of the past 
to polynomials; denseness of polynomials allows prediction with zero error (a 
'bad' situation - determinism), which happens iff (Sz) does not hold. There 
is a detailed account of the (rather involved) history here in [Si4] §2.3. Other 
classic contributions include work of Krein in 1945, Levinson in 1947 [Lev] 
and Wiener in 1949 [Wi2]. See [BroDav] §5.8 (where un-normalized Lebesgue 
measure is used, so there is an extra factor of 2n on the right of (K)), [Roz] 
§11.5 from the point of view of time series, [Si4] for OPUC. 

Pure non- determinism, (PND 
When the remote past is trivial, 

oo 

^-oo:= D ^n = {0}, (PND) 

n=-~ oo 

there is no deterministic component in the Wold decomposition, and no sin- 
gular component in the spectral measure. The process is then called purely 
non- deterministic. Thus 

(PND) = (ND) + (/i s = 0) = (Sz) + (fi a = 0) = (a>0) + (fi a = 0) (PND) 

(usage differs here: the term 'regular' is used for (PND) in [IbRo], IV. 1, but 
for (ND) in [Do], XII.2). 

The Szego function and Hardy spaces 

Szego's theorem is the key result in the whole area, and to explore it 
further we need the Szego function (h, below). For this, we need the language 
and viewpoint of the theory of Hardy spaces, and some of its standard results; 
several good textbook accounts are cited in §1. For < p < oo, the Hardy 
space H p is the class of analytic functions / on D for which 

mv r<1 (± f* \f(re»\>d0) 1/p < oo. (H p ) 

As well as in time series and prediction, as here, Hardy spaces are crucial 
for martingale theory (see e.g. [Binl] and the references there). For an 
entertaining insight into Hardy spaces in probability, see Diaconis [Dia]. 
For non-deterministic processes, define the Szego function h by 

h(z) := exp(i- J (^T^) logw(6)d6) (z G D), (OF) 



13 



(note that in [Inl-3], [InKal,2], [Roz] II. 5 an extra factor y/2n is used on the 
right), or equivalent ly 

H(z) := h\z) = exp(i- J (^±1) logw(9)d9) (z G D). 

Because logu> G L\ by (Sz), H is an outer function for H 1 (whence the name 
(OF) above); see Duren [Du], §2.4. By Beurling's canonical factorization 
theorem, 

(i) H G Hi, the Hardy space of order 1 ([Du], §2.4), or as H = h 2 , h G H 2 . 

(ii) The radial limit 

H(e id ) := \im H(re id ) 

exists a.e., and 

\H{e i6 )\ = \h(e ie )\ 2 = w(9) 

(thus h may be regarded as an 'analytic square root' of w). See also Hoffman 
[Ho], Ch. 3-5, Rudin [Ru], Ch. 17, Helson [He], Ch. 4. 
Kolmogorov's formula now reads 

a 1 = m 2 = h(0) 2 = G(fi) = exp(^ / log w(0)d8). (K) 

2n J 

When o > 0, the Maclaurin coefficients m = (m n ) of the Szego function h(z) 
are the moving-average coefficients of the Wold decomposition (recall that 
the moving- average component does not appear when a = 0); see Inoue [In3] 
and below. When a > 0, m G £2 is equivalent to convergence in mean square 
of the moving-average sum YfjLo m n-jij in the Wold decomposition. This is 
standard theory for orthogonal expansions; see e.g. [Do], IV.4. Note that a 
function being in H 2 and its Maclaurin coefficients being in £ 2 are equivalent 
by general Hardy-space theory; see e.g. [Ru], 17.10 (see also Th. 17.17 for 
factorization), [Du] §1.4, 2.4, [Z2], VII.7. 

Simon [Si4], §2.8 - 'Lots of equivalences' - gives Szego's theorem in two 
parts. One ([Si4] Th. 2.7.14) gives twelve equivalences, the other ([Si4], Th. 
2.7.15) gives fifteen; the selection of material is motivated by spectral theory 
[Si5]. Theorem 3 above extends these lists of equivalences, and treats the 
material from the point of view of probability theory. (It does not, however, 
give a condition on the autocorrelation 7 = (7^) equivalent to (Sz); this is 
one of the outstanding problems of the area.) 

The contrast here with Verblunsky's theorem is striking. In general, one 
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has unrestricted parametrization: all values \a n \ are possible, for all n. But 
under Szego's condition, one has a G £2, and in particular a n — > 0, as in 
Rakhmanov's theorem. Thus non-deterministic processes fill out only a tiny 
part of the a-parameter space D°°. One may regard this as showing that the 
remote past, trivial under (Sz), has a rich structure in general, as follows: 

Szego's alternative (or dichotomy). 
One either has 

logw G Li and "H_oo 7^ H- n ^ H 

or 

\ogw £ Li and H-00 = H- n = H. 

In the former case, a occupies a tiny part £2 of D°°, and the remote past 
T-t-oo is identified with L2(fi s ). This is trivial iff /x s = 0; cf. (PND). In the 
second case, a occupies all of D°°, and the remote past is the whole space. 

Szego's dichotomy may be interpreted by analogy with physical systems. 
Some systems (typically, liquids and gases) are 'loose' - left alone, they will 
thermalize, and tend to an equilibrium in which the details of the past history 
are forgotten. By contrast, some systems (typically, solids) are 'tight': for 
example, in tempered steel, the thermal history is locked in permanently by 
the tempering process. Long memory is also important in economics and 
econometrics; for background here, see e.g. [Rob], [TeKi]. 
Note. 1. Our h is the Szego function D of Simon [Si4], (2.4.2), and -l/h 
(see below) its negative reciprocal —A [Si4], (2.2.92): 

h = D, -l/h = -A 

(we use both notations to facilitate comparison between [Inl-3], [InKal,2], 
which use h, to within the factor \/2tt mentioned above, and [Si4], our refer- 
ence on OPUC, which uses D). 

2. Both h and —l/h are analytic and non- vanishing in D. See [Si4], Th. 
2.2.14 (for -l/h, or A), Th. 2.4.1 (for h, or D). 

3. That (Sz) implies h = D is in the unit ball of H 2 is in [Si4], Th. 2.4.1. 

4. See de Branges and Rovnyak [dBR] for general properties of such square- 
summable power series. 

5. Our autocorrelation 7 is Simon's c (he calls our 7„, or his c n , the moments 
of /i: [Si4], (1.1.20)). Our moving-average coefficients m = (m n ) have no 
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counterpart in [Si4], and nor do the autoregressive coefficients r = (r n ) or 
minimality (see below for these). We will also need the Fourier coefficients of 
logw, known for reasons explained below as the cepstrum), which we write 
as L = (L n ) ('L for logarithm': Simon's L n [Si4], (6.1.13)), and a sequence 
b = (b n ), the phase coefficients (Fourier coefficients of li/h). 
6. Lund et al. [LuZhKi] give several properties - monotonicity, convexity 
etc. - which one of m, 7 has iff the other has. 

MA(oo) andAR(oo) 

The power series expansion 

00 

h(z) = Y j m n z n (zeD) 

n=0 

generates the MA(oo) coefficients m = (m n ) in the Wold decomposition. 
That of 

00 

-\/h{z) = Y,r n z n (zeD) 

n=0 

generates the AR(oo) coefficients r = (r n ) in the (infinite-order) autoregres- 
sion 

n 

£ r n _^ + e„ = (neZ). (AR) 

j=-oo 

See [InKa2] §2, [In3] for background. 

One may thus extend the above list of one-one correspondences, as follows: 

Under (Sz), a, //, 7 ■<->■ m — (m n ) <->■ h, — 1/h ■<-)■ r — (r n ). 

Finite and infinite predictor coefficients. 

We met the n-vector <p n of finite-predictor coefficients in (fpc) of §1; we 
can extend it to an infinite vector, still denoted <f> n , by adding zeros. The 
corresponding vector := (0 1 ,0 2) • • •) of infinite-predictor coefficients gives 
the infinite predictor 

00 

p ( _oo -i]X = J2 ( Pj x -j (^ c ) 

([InKa2], (1.4)). One would expect convergence of finite-predictor to infinite- 
predictor coefficients; under Szego's condition, one has such convergence in 
£2 iff (PND), i.e., pi s = 0: 

</) n ->■ in £ 2 <=> {PND) 
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(Pourahmadi [Pou], Th. 7.14). 



The Szego limit theorem. 

With G(/i) as above, write T n (or T n (7), or T n (n)) for the n x n Toeplitz 
matrix with elements 

T^ n) - c 

obtained by truncation of the Toeplitz matrix T (cf. [BotSi2]). Szego 's limit 
theorem states that, under (Sz), its determinant satisfies 

- log det T n ->■ (n ->■ oo) 

(note that (Sz) is needed for the right to be defined). A stronger statement 
- Szego's strong limit theorem - holds; we defer this till §5. 

The Szego limit theorem is used in the Whittle estimator of time-series 
analysis; see e.g. Whittle [Wh], Hannan [Ha2]. 

Phase coefficients. 

When the Szego condition (Sz) holds, the Szego function h(z) = rn n z n 
is defined. We can then define the phase function h/h, so called because it 
has unit modulus and depends only on the phase or argument of h (Peller 
[Pel], §8.5). Its Fourier coefficients b n are called the phase coefficients. They 
are given in terms of m — (m n ) and r = (r n ) by 

oo 

b n :=^m k r n+k (n = 0, 1, 2, . . .). (b) 
o 

The role of the phase coefficients is developed in [BilnKa]. They are impor- 
tant in connection with rigidity (§6 below), and Hankel operators [Pel]. 

Rajchman measures. 

In the Gaussian case, mixing in the sense of ergodic theory holds iff 

7 n ->■ (n ->■ oo) 

([CorFoSi], §14.2, Th. 2). Since (Sz) is 7 G £ 2 , which implies 7„ ->■ 0, this 
is even weaker than (Sz). Measures for which this condition holds are called 
Rajchman measures (they were studied by A. Rajchman in the 1920s). Here 
the continuous singular part ji cs of /x is decisive; for a characterization of 
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Rajchman measures, see Lyons ([Lyl] - [Ly3] and the appendix to [KahSa]). 
ARMA(p, q). 

The Box- Jenkins ARMA(p, q) methodology ([BoxJeRe], [BroDav]: au- 
toregressive of order p, moving average of order q - see §6.3 for MA(q)) 
applies to stationary time series where the roots of the relevant polynomials 
lie in the unit disk (see e.g. [BroDav] §3.1). The limiting case, of unit roots, 
involves non-stationarity, and so the statistical dangers of spurious regression 
(§1); cf. Robinson [Rob], p. 2. We shall meet other instances of unit-root phe- 
nomena later (§6.3). 

Szego's theorem and the Gibbs Variational Principle 

We point out that Verblunsky [V2] proved the Gibbs Variational Princi- 
ple, one of the cornerstones of nineteenth-century statistical mechanics, for 
the Szego integral: 

inf 9 [ J e 9 dfi/exp( J gd6/2n)} = exp[ J log w(9)d9/2ir]. 

For details, see e.g. Simon [Si9] §§2.2, 10.6, [SilO], Ch. 16, 17. For back- 
ground on the Gibbs Variational Principle, see e.g. Simon [Sil], III. 4, Georgii 
[Geo], 15.4, Ellis [Ell], III.8. 

§4. Strong conditions: Baxter's theorem 

The next result ([Baxl], [Bax2], [Bax3]; [Si4], Ch. 5) gives the first of our 
strong conditions. 

Theorem 4 (Baxter's theorem). The following are equivalent: 

(i) the Verblunsky coefficients (or PACF) are summable, 

a e h; (B) 

(ii) the autocorrelations are summable, 7GI1, and \x is absolutely continuous 
with continuous positive density: 

mm e w(9) > 0. 

Of course, ( r y n ) summable gives, as the 7„ are the Fourier coefficients of 
/i, that jj, is absolutely continuous with continuous density w; thus w > 
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iff inf w = minw > 0.) We extend this list of equivalences, and bring out 
its probabilistic significance, in Theorem 5 below on £\ (this is substantially 
Theorem 4.1 of [In3]). We call a G £\ (or any of the other equivalences in 
Theorem 4) Baxter's condition (whence (B) above). Since £\ C £2, Baxter's 
condition (B) ('strong') implies Szego's condition (Sz) ('weak'). 

Theorem 5 (Inoue). For a stationary process X, the following are equiva- 
lent: 

(i) Baxter's condition (B) holds: a G £\. 

(ii) 7 G £\, fi s = and the spectral density w is continuous and positive. 

(iii) (PND) (that is, (Sz) / (ND) + pL s — 0) holds, and the moving-average 
and autoregressive coefficients are summable: 

m G £1, re t x . 

(iv) m G £±, fji s — and the spectral density u> is continuous and positive. 

(v) r G £1, fj, s — and the spectral density u> is continuous and positive. 

Proof. 

(i) <^ (ii). This is Baxter's theorem, as above. 

(iii) =>- (iv), (v). By (PND), (Sz) holds, so the non-tangential limit 

00 

h(e ld ) = \imh(re ie ) = V m n e md 

rn n=0 

exists a.e. But as m G £1, h(e td ) is continuous, so this holds everywhere. 
Since 

00 

w(6) = \h(e id )\ 2 = \D(e w )\ 2 = | ^ m n e m9 \\ 

n=0 

w is continuous. Letting r 7 1 in 

00 00 
h(z)(-l/h(z)) = (J2m n r n e me )(J2r n r n e me ) = -1 


gives similarly 

00 00 

(J2m n e me )(J2r n e me ) = -1. 


So h(e %e ) has no zeros, so neither does w. That is, (iv), (v) hold. 

(iv) =>- (iii). As w is positive and continuous, w is bounded away from and 
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oo. So 1/w is also. So 

oo 

l/w{6) = \l/h(e ie )\ 2 = |A(e^)| 2 = | £r n e^| 2 , 

n=0 

where A = l/D. (See [Si4], Th. 2.2.14, 2.7.15: the condition A<x,(.) > 
there is (Sz), so holds here.) By Wiener's theorem, the reciprocal of a non- 
vanishing absolutely convergent Fourier series is an absolutely convergent 
Fourier series (see e.g. [Ru], Th. 18.21). So from m G £ 1 we obtain r G £\, 
whence (iii) (cf. [Berk], p. 493). 

(v) =>- (iii). This follows as above, by Wiener's theorem again, 
(iv) =>- (ii). From the MA(oo) representation, 

oo 

7„ = J2 m \n\+km k (n G Z) (conv) 

([InKa2], (2.21)). So as £\ is closed under convolution, m G £\ implies 7 G £±, 
indeed with 

hlli < \\m\\l, 

giving (ii). 

(ii) ^> (v). We have 

4>j = c o r J = ar i 

with (f)j the infinite-predictor coefficients ([InKa2], (3.1)). Then r G £\ fol- 
lows by the Wiener-Levy theorem, as in Baxter [Ba3], 139-140. // 

Note. 1. Under Baxter's condition, both \h\ and \l/h\ (or \D\ and |A| = 
1 1/D\ ) are continuous and positive on the unit circle. As h, 1/h are analytic in 
the disk, and so attain their maximum modulus on the circle by the maximum 
principle, 

inf > 0, inf|l/M-)l>0 
(and similarly for £>(.), A); [Si4], (5.2.3), (5.2.4). 

2. The hard part of Baxter's theorem is (ii) (i), as Simon points out ( [Si4] , 
314). 

3. Simon [Si4], Th. 5.2.2 gives twelve equivalences in his final form of Bax- 
ter's theorem. (He does not, however, deal explicitly with m and r.) 

4. Simon also gives a more general form, in terms of Beurling weights, v. 
The relevant Banach algebras contain the Wiener algebra used above as the 



20 



special case v — 1. 

5. The approach of [Si4], §5.1 is via truncated Toeplitz matrices and their 
inverses. The method derives, through Baxter's work, from the Wiener-Hopf 
technique. This point of view is developed at length in [BotSil], [BotSi2]. 
Baxter's motivation was approximation to infinite-past predictors by finite- 
past predictors. 

Long-range dependence 

In various physical models, the property of long-range dependence (LRD) 
is important, particularly in connection with phase transitions (see e.g. [Sil], 
Ch. II, [Gril], Ch. 9, [Gri2], Ch. 5), to which we return below. This is a 
spatial property, but applies also in time rather than space, when the term 
used is long memory. A good survey of long-memory processes was given by 
Cox [Cox] in 1984, and a monograph treatment by Beran [Ber] in 1994. For 
more recent work, see [DouOpTa], [Rob], [Gao] Ch. 6, [TeKi], [GiKoSu]. 

Baxter's theorem is relevant to the definition of LRD recently proposed 
independently by Debowski [Deb] and Inoue [In3]: long-range dependence, 
or long memory, is non-summability of the PACF: 

X has LRD iff a £ i x . (DI) 

While the broad concept of long memory, or LRD, has long been widely 
accepted, authors differed over the precise definition. There were two leading 
candidates: 

(i) LRD is non-summability of covariances, 7 ^ t\. 

(ii) LRD is covariance decaying like a power: 7„ ~ c/n 1 ~ 2d as n — > 00, for 
some parameter d G (0, 1/2) (d for differencing - see below) and constant 
c G (0, 00) (and so In = 00). 

Note. 1. In place of (ii), one may require w(9) ~ C/9 2d as 9 I 0, for some 
constant C G (0, 00). The constants here may be replaced by slowly varying 
functions. See e.g. [BinGT] §4.10 for relations between regular variation of 
Fourier series and Fourier coefficients. 

2. One often encounters, instead of d G (0, 1/2), a parameter H = d + \ G 
(1/2, 1). This H is the Hurst parameter, named after the classic studies by 
the hydrologist Hurst of water flows in the Nile; see [Ber], Ch. 2. 

3. For d G (0, \), £(.) slowly varying, the following class of prototypical long- 
memory examples is considered in [InKa2], §2.3 (see also [Inl], Th. 5.1): 

ln ~£( n ) 2 B(d,l-2d)/n 1 - 2d , 
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-.l/n 



dsm(ird) 1 1+d 



7r £(nj 

See the sources cited for inter-relationships between these. 

4. In [InKa2], Example 2.6, the class of FARIMA(p,d,q) processes is con- 
sidered (obtained from an ARIMA(p, q) process by fractional differencing of 
order d - see [Hos], [BroDav], [KokTa]). For d G (0,1/2) these have long 
memory; for d = they reduce to the familiar ARMA(p, q) processes. 

Li ([Li], §3.4) has recently given a related but different definition of long 
memory; we return to this in §5 below. 

5. Strong conditions: the strong Szego theorem 

The work of this section may be motivated by work from two areas of 
physics. 

1. The cepstrum. 

During the Cold War, the problem of determining the signature of the 
underground explosion in a nuclear weapon test, and distinguuishing it from 
that of an earthquake, was very important, and was studied by the American 
statistician J. W. Tukey and collaborators. Write L = (L n ), where the L n 
are the Fourier coefficients of logw, the log spectral density: 



L n := J \ogw{6)e m9 d6/27T. 



Thus exp(L ) is the geometric mean G(fi). The sequence L is called the 
cepstrum, L n the ceptstral coefficients (Simon's notation here is L n ; [Si4], 
(2.1.14), (6.1.11)); see e.g. [OpSc], Ch. 12. The terminology dates from 
work of Bogert, Healy and Tukey of 1963 on echo detection [BogHeTu]; see 
McCullagh [McC], Brillinger [Bri] (the term is chosen to suggest both echo 
and spectrum, by reversing the first half of the word spectrum; it is accord- 
ingly pronounced with the c hard, like a k). 

2. The strong Szego limit theorem. 

This (which gives the weak form on taking logarithms) states (in its 
present form, due to Ibragimov) that 

-+ E(im) := exp{£ kLl)} (n -+ oo) 
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(of course the sum here must converge; it turns out that this form is best- 
possible: the result is valid whenever it makes sense ( [Si4] , 337). 

The motivation was Onsager's work in the two-dimensional Ising model, 
and in particular Onsager's formula, giving the existence of a critical tem- 
parature T c and the decay of the magnetization as the temperature T | T c ; 
see [BotSi2] §5.1, [Sil] II. 6, [McCW]. The mechanism was a question by On- 
sager (c. 1950) to his Yale colleague Kakutani, who asked Szego ([Si4], 331). 

Write H x l 2 for the subspace of £2 of sequences a = (a n ) with 

||a|| 2 :=£(l + |n|)K| 2 <oo (H 1 / 2 ) 

n 

(the function of the '1' on the right is to give a norm; without it, |.| van- 
ishes on the constant functions). This is a Sobolev space ( [Si4] , 329, 337; it 
is also a Besov space, whence the alternative notation B^ 2 ; see e.g. Peller 
[Pel], Appendix 2.6 and §7.13). This is the space that plays the role here 
of £2 in §2 and £\ in §3. Note first that, although £\ and H 1 / 2 are close 
in that a sequence (n c ) of powers belongs to both or neither, neither con- 
tains the other (consider a n = l/(nlogn), a n = l/\/n if n = 2 k , otherwise). 

Theorem 6 (Strong Szego Theorem). 

(i) If (PND) holds (i.e. (Sz) = (ND) holds and fi, = 0), then 

00 00 
S(/i) = n(l-NV=exp(i;^n) 

j=l n=l 

(all three may be infinite), with the infinite product converging iff the strong 
Szego condition 

a G # 1/2 , (sSz) 

holds. 

(ii) (sSz) holds iff 

L e H 1/2 (sSz') 

holds. 

(iii) Under (Sz), finiteness of any (all three) of the expressions in (i) forces 
Us = 0. 

Proof. Part (i) is due to Ibragimov ( [Si4] , Th. 6.1.1), and (ii) is immediate 
from this. Part (iii) is due to Golinski and Ibragimov ( [Si4] , Th. 6.1.2; cf. 
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[Si2]). // 



Part of Ibragimov's theorem was recently obtained independently by Li 
[Li], under the term reflectrum identity (so called because it links the Verblun- 
sky or reflection coefficients with the cepstrum), based on information theory 
- mutual information between past and future. Earlier, Li and Xie [LiXi] had 
shown the following: 

(i) a process with given autocorrelations 70, . . . , 7 P with minimal information 
between past and future must be an autoregressive model AR{p) of order p; 

(ii) a process with given cepstral coefficients L , . . . , L p with minimal in- 
formation between past and future must be a Bloomfield model BL{p) of 
order p ([Bll], [B12]), that is, one with spectral density w(Q) = exp{L + 
2ELi^cos k9}. 

Another approach to the strong Szego limit theorem, due to Kac [Kac], 
uses the conditions 

infw(.)>0, 7 = ( 7n )e£i, 7 e# 1/2 

(recall that l\ and H 1 ^ 2 are not comparable). This proof, from 1954, is linked 
to probability theory - Spitzer's identity of 1956, and hence to fluctuation 
theory for random walks, for which see e.g. [Ch], Ch. 8. 

The Borodin- Okounkov formula. 

This turns the strong Szego limit theorem above from analysis to algebra 
by identifying the quotient on the left there as a determinant which visibly 
tends to 1 as n — > 00 [BorOk]; see [Si4] §6.2. (It was published in 2000, 
having been previously obtained by Geronimo and Case [GerCa] in 1979; see 
[Si4] 337, 344, [Bot] for background here.) In terms of operator theory and 
in Widom's notation [Bot], the result is 

det T n (a) = det(I - Q n H(b)H(c)Q n ) 
G(a) n det(I - H(b)H(c)) ' 

for a a sufficiently smooth function without zeros on the unit circle and with 
winding number 0. Then a has a Wiener-Hopf factorization a = a_a + ; b : = 
a.a^I 1 , c := al 1 a+; H(b), H(c) are the Hankel matrices H(b) = (bj+k+i)j° k=0 , 
H(c) = (c-j-k-i)j° k=0 , and Q n is the orthogonal projection of £ 2 (1,2, ...) 
onto £ 2 ({n,n + 1, . . .}). By Widom's formula, 

00 

l/det(I - H(b)H(c)) = exp{]T kL 2 k } =: E(a) 

k=i 
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(see e.g. [Si4], Th. 6.2.13), and Q n H(b)H(c)Q n ^ in the trace norm, 
whence 

det T n (a)/G(a) n ->■ E(a), 

the strong Szego limit theorem. See [Si4], Ch. 6, [Si6], [BasW], [BotW] (in 
[Si4] §6.2 the result is given in OPUC terms; here b, c are the phase function 
h/h and its inverse). 

(B + sSz). 

We may have both of the strong conditions (B) and (sSz) (as happens in 
Kac's method [Kac], for instance). Matters then simplify, since the spectral 
density w is now continuous and positive. So w is bounded away from and 
oo, so log w is bounded. Write 

u 2 {5,h) -.= sup (7 \h(\ + e) -h{\)\ 2 dx) 1/2 

\0\<8 J 

for the L 2 modulus of continuity. Applying [IbRo], IV. 4, Lemma 7 to logw, 

oo 

L E H 1 ' 2 u 2 (l/k, logw) < oo, 

k=i 

and applying it to w, 

oo 

, w) < oo. 

k=l 

Thus under (B), L E H 1 / 2 and 7 G H 1 / 2 become equivalent. This last 
condition is Li's proposed definition of long-range dependence: 

LRD 7 i H 1 ' 2 (Li) 

([Li], §3.4; compare the Debowski-Inoue definition (DI) above, that LRD iff 

We are now in WDH 1 ^ 2 , the intersection of H 1 / 2 with the Wiener algebra 
W (of absolutely convergent Fourier series) relevant to Baxter's theorem as 
in §3. As there, we can take inverses, since the Szego function is non-zero 
on the circle (cf. [BotSi2], §5.1). One can thus extend Theorem 2 to this 
situation, including the cepstral condition L E H 1 ^ 2 (Li [Li], Th. 1 part 3, 
showed that L E H 1 ! 2 and 7 E H 1 / 2 are equivalent if w is continuous and 
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positive). 



Loo + (sSz). 

The bounded functions in H 1 ^ 2 form an algebra, the Krein algebra K, a 
Banach algebra under convolution; see Krein [Kr], Bottcher and Silbermann 
[BotSil] Ch. 10, [BotSi2] Ch. 5, [Si4], 344, [BotKaSi]. The Krein algebra 
may be used as a partial substitute for the Wiener algebra Moused to treat 
Baxter's theorem in §3 (W fl H 1 ^ 2 is also an algebra: [BotSi], §5.1). 

5.1. (ft- mixing 

Weak dependence may be studied by a hierarchy of mixing conditions; 
for background, see e.g. Bradley [Bral], [Bra2], [Bra3], Bloomfield [B13], 
Ibragimov and Linnik [IbLi], Ch. 17, Cornfeld et al. [CorFoSi]), and in the 
Gaussian case Ibragimov and Rozanov [IbRo], Peller [Pel]. We need two 
sequences of mixing coefficients: 

0(n) := Esup{\P(A\T°J - P(A)\ : A G J~>}; 

p(n) :=p{F°_^^), 

where 

p(A, B) := sup{||£(/|£) - Ef\y\\f\\ 2 : / G L 2 (A)}. 

The process is called (p-mixing if 4>{n) — > as n — > oo, p-mixing if p{n) — > 0. 
(The reader is warned that some authors use other letters here - e.g. [IbRo] 
uses j3 for our 0; we follow Bradley.) 

We quote [Bral] that 0-mixing implies p-mixing. We regard the first as 
a strong condition, so include it here, but the second and its several weaker 
relatives as intermediate conditions, which we deal with in §6 below. 

The spectral characterization for 0-mixing is 

fi 8 = 0, w(9) = \P(e w )\ 2 w*(9), 

where P is a polynomial with its roots on the unit circle and the cepstrum 
L* = (L*) of w* satisfies the strong Szego condition (sSz) ([IbRo] IV.4, p. 
129). This is weaker than (sSz). In the Gaussian case, 0- mixing (also known 
as absolute regularity) can also be characterized in operator-theoretic terms: 
<p(n) can be identified as \Jtr(B n ), where B n are compact operators with 
finite trace, so 0-mixing is tr(B n ) ([IbRo], IV.2 Th. 4, IV.3 Th. 6). 
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6. Intermediate conditions 

We turn now to four intermediate conditions, in decreasing order of strength. 

6.1. p -mixing 

The spectral characterization of p-mixing (also known as complete regu- 
larity) is 

fi a = 0, w(6) = \P(e M )\ 2 w*(9), 
where P is a polynomial with its roots on the unit circle and 

log W* = U + V, 

with u, v real and continuous (Sarason [Sa2]; Helson and Sarason [HeSa]). 
An alternative spectral characterization is 

p s = 0, w(9) = \P(e w )\ 2 w*(9), 

where P is a polynomial with its roots on the unit circle and for all e > 0, 

log w* = r € + u e + v t) 

where r e is continuous, u e , v e are real and bounded, and ||u £ || + \\v e \\ < e 
([IbRo], V.2 Th. 3; we note here that inserting such a polynomial factor 
preserves complete regularity, merely changing p - [IbRo] V.l, Th. 1). 

6.2. Positive angle: the Helson- Szego and Helson- Saras on conditions. 

We turn now to a weaker condition. For subspaces A, B of "H, the angle 
between A and B is defined as 

cos -1 sup{|(a, 6)| : a e A,b e B}. 

Then A, B are at a positive angle iff this supremum is < 1. One says that 
the process X satisfies the positive angle condition, {PA), if for some time 
lapse k the past cls(X m : m < 0) and the future cls(Xk +m : m > 0) are at 
a positive angle, i.e. p(0) = . . . p(k — 1) = l,p(k) < 1, which we write as 
PA(k) (Helson and Szego [HeSz], k — 1; Helson and Sarason [HeSa], k > 1). 
The spectral characterization of this is 

p s = 0, w(e) = \P(e* e )\ 2 w*(6), 



27 



where P is a polynomial of degree k — 1 with its roots on the unit circle and 

log W* = U + V, 

where u, v are real and bounded and \\v\\ < tt/2 ([IbRo] V.2, Th. 3, Th. 4). 
(The role of tt/2 here stems from Zygmund's theorem of 1929, that if u is 
bounded and ||u|| < tt/2, exp{w} G L x ([Zl], [Z2] VII, (2.11), [Tor], V.3: cf. 
[Pel] §3.2.) Thus p-mixing implies (PA) (i.e. PA(k) for some k). 

The case PA(k) for k > 1 is a unit-root phenomenon (cf. the note at 
the end of §3). We may (with some loss of information) reduce to the case 
PA(1) by sampling only at every kth time point (cf. [Pel], §§8.5, 12.8). We 
shall do this for convenience in what follows. 

It turns out that the Helson-Szego condition (PA(1)) coincides with 
Muckenhoupt's condition A 2 in analysis: 

where |.| is Lebesgue measure and the supremum is taken over all subin- 
tervals / of the unit circle T. See e.g. Hunt, Muckenhoupt and Wheeden 
[HuMuWh]. With the above reduction of PA to PA(1), we then have p- 
mixing implies PA(1) (= A 2 ). 

6.3. Pure minimality 

Consider now the interpolation problem, of finding the best linear inter- 
polation of a missing value, X say, from the others. Write 

H' n := cls{X m : m ^ n} 

for the closed linear span of the values at times other than n. Call X minimal 
if 

X n £ H' n , 

purely minimal if 

n^n={o>. 

n 

The spectral condition for minimality is (Kolmogorov in 1941, [Kol] §10) 

1/w G Li, (min) 
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and for pure minimality, fi s = also (Makagon-Weron in 1976, [MakWe]; 
Sarason in 1978, [Sal]; [Pou], Th. 8.10): 

1/w G Li, /i s — 0. (purmin) 

Of course (A 2 ) implies 1/w G Li, so the Helson-Szego condition (PA(1)) (or 
Muckenhoupt condition (A 2 )) implies pure minimality. (From logo; < x — 1 
for x > 1, (min) implies (Sz): both restrict the small values of w > 0, and 
in particular force w > a.e.) For background on the implication from the 
Helson-Szego condition PA(1) to (A 2 ), see e.g. Garnett [Gar], Notes to Ch. 
VI, Treil and Volberg [TrVo2]. 

Under minimality, the relationship between the moving- average coeffi- 
cients m = (m n ) and the autoregressive coefficients r = (r n ) becomes sym- 
metrical, and one has the following complement to Theorem 4: 

Theorem 7 (Inoue). For a stationary process X, the following are equiva- 
lent: 

(i) The process is minimal. 

(ii) The autoregressive coefficients r = (r n ) in (AR) satisfy r G £ 2 - 

(iii) l/heH 2 . 

Proof. Since 

l/h(z) = exp(i- J ( Iog(l/«;(fl))cW) (^ G D), (OF') 

and ±logw are in L\ together, when 1/w G L\ (i.e. the process is minimal) 
one can handle 1/w, 1/h, m = (m n ) as we handled w, h and r = (r n ), giving 

1/h G if 2 

and 

r = (r n ) G £ 2 . 

Conversely, each of these is equivalent to (min); [Inl], Prop. 4.2. // 

ff.^. fli^idi^ (LM), (CND), (IPF). 
Rigidity; the Levinson-McKean condition. 

Call g G i? 1 ngiG? if is determined by its phase or argument: 

feH 1 (f not identically 0), f/\f\=g/\g\ a.e. 
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/ = eg for some positive constant c. 

This terminology is due to Sarason [Sal], [Sa2]; the alternative terminology, 
due to Nakazi, is strongly outer [Nal], [Na2]. One could instead say that such 
a function is determined by its phase. The idea originates with de Leeuw and 
Rudin [dLR] and Levinson and McKean [LevMcK]. In view of this, we call 
the condition that /x be absolutely continuous with spectral density w = \h\ 2 
with h 2 rigid, or determined by its phase, the Levinson- McKean condition, 
(LM). 

Complete non-determinism; intersection of past and future. 
In [InKa2], the following two conditions are discussed: 

(i) complete non- determinism, 

%(-oo,-l] n"H[ ,oo) = {0} 

(for background on this, see [BUeHa], [JeBl], [JeBlBa]), 

(ii) the intersection of past and future property, 

n i - 00 ,- 1] nn l - n , 00 ) = n l - n ,- 1] (n = i,2,...) (ipf) 

These are shown to be equivalent in [InKa2]. In [KaBi], it is shown that both 
are equivalent to the Levinson-McKean condition, or rigidity: 

(LM) <£> (IPF) <£> (CND). 

These are weaker than pure minimality ([B13], §7, [KaBi]). But since (CND) 
was already known to be equivalent to (PND) + (IPF), they are stronger 
than (PND). This takes us from the weakest of the four intermediate con- 
ditions of this section to the stronger of the weak conditions of §3. 

7. Remarks 

I. VMO C BMO. 

The spectral characterizations given above were mainly obtained before 
the work of Fefferman [Fe] in 1971, Fefferman and Stein [FeSt] in 1972 (see 
Garnett [Gar], Ch. VI for a textbook account): in particular, they predate 
the Fefferman- Stein decomposition of a function of bounded mean oscillation, 
/ G BMO, as 

f = u + v, u,v G Loo. 



(CND) 



30 



This has a complement due to Sarason [Sa3], where / here is in VMO iff u, 
v are continuous. Sarason also gives ([Sa3], Th. 2) a characterization of his 
class of functions of vanishing mean oscillation VMO within BMO related 
to Muckenhoupt's condition (A 2 ). 

While both components u, v are needed here, and may be large in norm, 
it is important to note that the burden of being large in norm may be born by 
a continuous function, leaving u and v together to be small in (L^) norm (in 
particular, less than n/2). This is the Ibragimov-Rozanov result ([IbRo], V.2 
Th. 3), used in §6.1 to show that absolute regularity (§5) implies complete 
regularity. 

2. H 1 ' 2 C VMO. 

The class H 1 ^ 2 is contained densely within VMO (Prop. A2, Boutet 
de Monvel-Berthier et al. [BouGePu]). For H 1 ^ 2 , one has a version of the 
Fefferman-Stein decomposition for BMO: 

feH 1/2 f = u + v, u, v G H 1/2 n L°° 

([Pel] §7.13). 

3. Winding number and index. 

The class H 1 ! 2 occurs in recent work on topological degree and wind- 
ing number; see Brezis [Bre], Bourgain and Kozma [BouKo]. The wind- 
ing number also occurs in operator theory as an index in applications of 
Banach- algebra methods and the Gelfand transform; see e.g. [Si4], Ch. 5 
(cf. Tsirelson [Ts]). 

4. Conformal mapping. 

The class H 1 ' 2 also occurs in work of Zygmund on conformal mapping 
([Z2], VII. 10). 

5. Rapid decay and continuability. 

Even stronger than the strong conditions considered here in §§4, 5 is 
assuming that the Verblunsky coefficients are rapidly decreasing. This is 
connected to analytic continuability of the Szego function beyond the unit 
disk; see [Si7] . 

6. Scattering theory. 

The implication from the strong Szego (or Golinskii-Ibragimov) condition 
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to the Helson-Szego/Helson-Sarason condition (PA) has a recent analogue in 
scattering theory (Golinskii et al. [GolKhPeYu], under \GI) implies (HSY). 

7. Wavelets. 

Traditionally, the subject of time series seemed to consist of two non- 
intercommunicating parts, 'time domain' and 'frequency domain' (known to 
be equivalent to each other via the Kolmogorov Isomorphism Theorem of 
§2). The subject seemed to suffer from schizophrenia (see e.g. [BriKri] and 
[HaKR]) - though the constant relevance of the spectral or frequency side 
to questions involving time directly is well illustrated in the apt title 'Past 
and future' of the paper by Helson and Sarason [HeSa] (cf. [Pel] §8.6). This 
unfortunate schism has been healed by the introduction of wavelet methods 
(see e.g. the standard work Meyer [Me], Meyer and Coifman [MeCo], and in 
OPUC, Treil and Volberg [TrVol]). The practical importance of this may be 
seen in the digitization of the FBI's finger-print data-bank (without which 
the US criminal justice system would long ago have collapsed). Dealing with 
time and frequency together is also crucial in other areas, e.g. in the high- 
quality reproduction of classical music. 

8. Higher dimensions: matrix OPUC (MOPUC). 

We present the theory here in one dimension for simplicity, reserving 
the case of higher dimensions for a sequel [Bin2]. We note here that in 
higher dimensions the measure fi and the Verblunsky coefficients a n become 
matrix- valued (matrix OPUC, or MOPUC), so one loses commutativity. The 
multidimensional case is needed for portfolio theory in mathematical finance, 
where one holds a (preferably balanced) portfolio of risky assets rather than 
one; see e.g. [BinFrKi]. 

9. Non- commutativity. 

Much of the theory presented here has a non-commutative analogue in op- 
erator theory; see Blecher and Labuschagne [BILa], Bekjan and Xu [BeXu] 
and the references cited there. 

10. Non-stationarity. 

As mentioned in §1, the question of whether or not the process is station- 
ary is vitally important, and stationarity is a strong assumption. The basic 
Kolmogorov Isomorphism Theorem can be extended beyond the stationary 
case in various ways, e.g. to harmonisable processes (see e.g. [Rao]). For 
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background, and applications to filtering theory, see e.g. [Kak]; for filtering 
theory, we refer to e.g. [BaiCr]. 

11. Continuous time. 

The Szego condition (Sz) for the unit circle (regarded as the boundary 
of the unit disc) corresponds to the condition 



for the real line (regarded as the boundary of the upper half-plane). This 
follows from the Mobius function w — (z — i)/(z + i) mapping the half-plane 
conformally onto the disc; see e.g. [Du], 189-190. The consequences of this 
condition are explored at length in Koosis' monograph on the 'logarithmic 
integral', [Koo2]. Passing from the disc to the half-plane corresponds prob- 
abilistically to passing from discrete to continuous time (and analytically to 
passing from Fourier series to Fourier integrals). The probabilistic theory is 
considered at length in Dym and McKean [DymMcK]. 

12. Gaussianity and linearity. 

We have mentioned the close links between Gaussianity and linearity in 
§1. For background on Gaussian Hilbert spaces and Fock space, see Jan- 
son [Jan], Peller [Pel]; for extensions to §§5.1, 6 in the Gaussian case, see 
[IbRo], [Pel], [Bral] §5. To return to the undergraduate level of our opening 
paragraph: for an account of Gaussianity, linearity and regression, see e.g. 
Williams [Wil], Ch. 8, or [BinFr]. 
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